It is well believed that the higher uncertainty in a word of the caption, the more inter-correlated context information is required to determine it. However, current image captioning methods usually consider the generation of all words in a sentence sequentially and equally. In this paper, we propose an uncertainty-aware image captioning framework, which parallelly and iteratively operates insertion of discontinuous candidate words between existing words from easy to difficult until converged. We hypothesize that high-uncertainty words in a sentence need more prior information to make a correct decision and should be produced at a later stage. The resulting non-autoregressive hierarchy makes the caption generation explainable and intuitive. Specifically, we utilize an image-conditioned bag-of-word model to measure the word uncertainty and apply a dynamic programming algorithm to construct the training pairs. During inference, we devise an uncertainty-adaptive parallel beam search technique that yields an empirically logarithmic time complexity. Extensive experiments on the MS COCO benchmark reveal that our approach outperforms the strong baseline and related methods on both captioning quality as well as decoding speed.
translated by 谷歌翻译
Recently, vector quantized autoregressive (VQ-AR) models have shown remarkable results in text-to-image synthesis by equally predicting discrete image tokens from the top left to bottom right in the latent space. Although the simple generative process surprisingly works well, is this the best way to generate the image? For instance, human creation is more inclined to the outline-to-fine of an image, while VQ-AR models themselves do not consider any relative importance of each component. In this paper, we present a progressive denoising model for high-fidelity text-to-image image generation. The proposed method takes effect by creating new image tokens from coarse to fine based on the existing context in a parallel manner and this procedure is recursively applied until an image sequence is completed. The resulting coarse-to-fine hierarchy makes the image generation process intuitive and interpretable. Extensive experiments demonstrate that the progressive model produces significantly better results when compared with the previous VQ-AR method in FID score across a wide variety of categories and aspects. Moreover, the text-to-image generation time of traditional AR increases linearly with the output image resolution and hence is quite time-consuming even for normal-size images. In contrast, our approach allows achieving a better trade-off between generation quality and speed.
translated by 谷歌翻译
Panoptic叙事接地(PNG)是一项新的任务,其目标是通过静止图像的密集叙事标题来分割事物和内容类别的视觉对象。先前的两阶段方法首先提取了通过现成的全盘分割模型提取分割区域的建议,然后进行粗糙的区域短语匹配,以将每个名词短语的候选区域接地。但是,两阶段的管道通常受到第一阶段低质量建议的性能限制,以及由区域特征池的损失以及为事物和东西类别设计的复杂策略引起的空间细节。为了减轻这些缺点,我们提出了一个单阶段的端到端像素匹配网络(PPMN),该网络将每个短语与其相应的像素直接匹配,而不是区域建议,并通过简单组合输出全段段。因此,我们的模型可以从密集注释的像素色素对的监督而不是稀疏的区域短语对中利用足够,更精细的跨模式语义对应关系。此外,我们还提出了与语言兼容的像素聚合(LCPA)模块,以进一步通过多轮修补剂增强短语特征的判别能力,该简化为每个短语选择最兼容的像素以适应相应的视觉上下文。广泛的实验表明,我们的方法在PNG基准测试中实现了新的最新性能,并具有4.0个绝对平均召回率增长。
translated by 谷歌翻译
现有的图像字幕的方法通常从左到右生成句子逐字,并在本地上下文中受到限制,包括给定的图像和历史记录生成的单词。在解码过程中,有许多研究目的是利用全球信息,例如迭代改进。但是,它仍然探讨了如何有效,有效地纳入未来的环境。为了回答这个问题,受到非自动回归图像字幕(NAIC)的启发,可以通过修改后的掩码操作利用两侧关系,我们的目标是将此进步嫁接到常规的自动回归图像字幕(AIC)模型,同时保持推理效率而无需进行推理效率额外的时间成本。具体而言,首先对AIC和NAIC模型结合了共享的视觉编码器,迫使视觉编码器包含足够有效的未来上下文。然后鼓励AIC模型捕获NAIC模型在其不自信的单词上互换的跨层互换的因果动态,该单词遵循教师学生的范式,并通过分配校准训练目标进行了优化。经验证据表明,我们所提出的方法清楚地超过了自动指标和人类评估的最新基线,对MS COCO基准测试。源代码可在以下网址获得:https://github.com/feizc/future-caption。
translated by 谷歌翻译
引用视频对象细分旨在预测视频中自然语言表达式引用的对象的前景标签。先前的方法要么取决于3D convnet,要么将附加的2D转向器作为编码器,以提取混合时空特征。但是,由于在解码阶段发生的延迟和隐式时空相互作用,这些方法遭受了空间错位或虚假分散因素的影响。为了解决这些限制,我们提出了一个语言桥梁的双链传输(LBDT)模块,该模块将语言用作中间桥,以在编码阶段早期完成显式和适应性的时空交互。具体地,在时间编码器中进行了交叉模式的注意,将单词和空间编码器引用以汇总和传递与语言相关的运动和外观信息。此外,我们还提出了在解码阶段的双边通道激活(BCA)模块,以通过通道激活进一步降低并突出时空一致的特征。广泛的实验表明,我们的方法在四个流行的基准测试基准上获得了新的最新性能,分别在A2D句子和J-HMDB句子上获得了6.8%和6.9%的绝对AP收益,同时消耗了大约7倍的计算机开销。
translated by 谷歌翻译
人道主义组织必须具有快速可靠的数据来应对灾害。在现实世界灾害中难以实施深度学习方法,因为在活动结束后很快收集损坏情况(培训数据)的地面真理数据可能会挑战。在这项工作中,通过成功地申请建立具有非常有限的标记数据和大量未标记数据的损害评估,在这项工作中展示了最近的自定节奏正面未标记的学习(PU)。将自欺欺人学习与来自2011年Tohoku地震,2018 Palu海啸和2018年飓风迈克尔收集的不同数据集进行了监督的基线和传统浦项学习。通过仅利用标记的损坏样本的一部分,我们展示了如何用自我PU技术训练的模型可以实现与监督学习的相当性能。
translated by 谷歌翻译
This paper focuses on designing efficient models with low parameters and FLOPs for dense predictions. Even though CNN-based lightweight methods have achieved stunning results after years of research, trading-off model accuracy and constrained resources still need further improvements. This work rethinks the essential unity of efficient Inverted Residual Block in MobileNetv2 and effective Transformer in ViT, inductively abstracting a general concept of Meta-Mobile Block, and we argue that the specific instantiation is very important to model performance though sharing the same framework. Motivated by this phenomenon, we deduce a simple yet efficient modern \textbf{I}nverted \textbf{R}esidual \textbf{M}obile \textbf{B}lock (iRMB) for mobile applications, which absorbs CNN-like efficiency to model short-distance dependency and Transformer-like dynamic modeling capability to learn long-distance interactions. Furthermore, we design a ResNet-like 4-phase \textbf{E}fficient \textbf{MO}del (EMO) based only on a series of iRMBs for dense applications. Massive experiments on ImageNet-1K, COCO2017, and ADE20K benchmarks demonstrate the superiority of our EMO over state-of-the-art methods, \eg, our EMO-1M/2M/5M achieve 71.5, 75.1, and 78.4 Top-1 that surpass \textbf{SoTA} CNN-/Transformer-based models, while trading-off the model accuracy and efficiency well.
translated by 谷歌翻译
Supervised Question Answering systems (QA systems) rely on domain-specific human-labeled data for training. Unsupervised QA systems generate their own question-answer training pairs, typically using secondary knowledge sources to achieve this outcome. Our approach (called PIE-QG) uses Open Information Extraction (OpenIE) to generate synthetic training questions from paraphrased passages and uses the question-answer pairs as training data for a language model for a state-of-the-art QA system based on BERT. Triples in the form of <subject, predicate, object> are extracted from each passage, and questions are formed with subjects (or objects) and predicates while objects (or subjects) are considered as answers. Experimenting on five extractive QA datasets demonstrates that our technique achieves on-par performance with existing state-of-the-art QA systems with the benefit of being trained on an order of magnitude fewer documents and without any recourse to external reference data sources.
translated by 谷歌翻译
Transformer has achieved impressive successes for various computer vision tasks. However, most of existing studies require to pretrain the Transformer backbone on a large-scale labeled dataset (e.g., ImageNet) for achieving satisfactory performance, which is usually unavailable for medical images. Additionally, due to the gap between medical and natural images, the improvement generated by the ImageNet pretrained weights significantly degrades while transferring the weights to medical image processing tasks. In this paper, we propose Bootstrap Own Latent of Transformer (BOLT), a self-supervised learning approach specifically for medical image classification with the Transformer backbone. Our BOLT consists of two networks, namely online and target branches, for self-supervised representation learning. Concretely, the online network is trained to predict the target network representation of the same patch embedding tokens with a different perturbation. To maximally excavate the impact of Transformer from limited medical data, we propose an auxiliary difficulty ranking task. The Transformer is enforced to identify which branch (i.e., online/target) is processing the more difficult perturbed tokens. Overall, the Transformer endeavours itself to distill the transformation-invariant features from the perturbed tokens to simultaneously achieve difficulty measurement and maintain the consistency of self-supervised representations. The proposed BOLT is evaluated on three medical image processing tasks, i.e., skin lesion classification, knee fatigue fracture grading and diabetic retinopathy grading. The experimental results validate the superiority of our BOLT for medical image classification, compared to ImageNet pretrained weights and state-of-the-art self-supervised learning approaches.
translated by 谷歌翻译
Knowledge graph embedding (KGE), which maps entities and relations in a knowledge graph into continuous vector spaces, has achieved great success in predicting missing links in knowledge graphs. However, knowledge graphs often contain incomplete triples that are difficult to inductively infer by KGEs. To address this challenge, we resort to analogical inference and propose a novel and general self-supervised framework AnKGE to enhance KGE models with analogical inference capability. We propose an analogical object retriever that retrieves appropriate analogical objects from entity-level, relation-level, and triple-level. And in AnKGE, we train an analogy function for each level of analogical inference with the original element embedding from a well-trained KGE model as input, which outputs the analogical object embedding. In order to combine inductive inference capability from the original KGE model and analogical inference capability enhanced by AnKGE, we interpolate the analogy score with the base model score and introduce the adaptive weights in the score function for prediction. Through extensive experiments on FB15k-237 and WN18RR datasets, we show that AnKGE achieves competitive results on link prediction task and well performs analogical inference.
translated by 谷歌翻译